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1  Introduction 

1.1  Approaches  to  Distributed  Simulation 

One  of  the  advantages  of  computers  is  that  they  can  be  used  to  simulate 
systems  that  cannot  be  observed  directly.  Parallel  computers  are  used  ex¬ 
tensively  for  simulation.  Many  parallel  simulators  are  time-driven:  At  the 
A;-th  step,  where  k  >  0,  the  simulator  computes  the  state  at  time  k  of  all  pro¬ 
cesses  in  the  physical  system  (i.e.,  the  system  being  simulated).  This  paper 
is  concerned  with  event-driven  distributed  simulation  in  which  the  simulator 
may  compute  the  behavior  of  different  subsystems  up  to  different  points  in 
time.  There  are  two  common  approaches  to  event-driven  distributed  sim¬ 
ulation:  the  conservative  approach  [1,  7]  and  the  optimistic  approach  [6]. 
A  computation  in  the  conservative  approach  proceeds  by  determining  the 
correct  behavior  of  each  physical  process  in  a  time  interval.  A  computa¬ 
tion  in  the  optimistic  approach  may  compute  an  incorrect  behavior  that  is 
corrected  later  by  rollback  and  recovery.  This  paper  presents  a  new  conser¬ 
vative  approach  that  appears  to  be  efficient.  Performance  measurements  of 
the  simulator  written  in  Cosmic  C,  running  on  an  Intel  iPSC/1,  are  provided. 

’On  leave  of  absence  from  The  University  of  Texas  at  Austin. 


1.2  Our  Point  of  Departure 

Most  of  the  distributed  simulators  described  in  the  literature  are  designed 
so  that  the  structure  of  the  simulator  mirrors  the  structure  of  the  physical 
system:  A  message  between  a  pair  of  processes  in  the  simulator  represents 
an  event  that  changes  the  states  of  the  corresponding  pair  of  processes  in 
the  physical  system.  By  contrast,  in  our  method  messages  are  used  for  a  va¬ 
riety  of  purposes,  and  the  process-interconnection  structure  of  the  simulator 
need  not  represent  the  structure  of  the  physical  system.  There  is  no  reason 
to  constrain  the  structure  of  the  simulator  program  to  reflect  the  physical 
world.  Our  goal  is  to  exploit  the  architecture  of  a  parallel  computer  to  obtain 
fast  execution,  and  the  relevant  yardstick  is  execution  time.  At  an  abstract 
level,  distributed  simulation  algorithms  are  suitable  for  all  parallel  archi¬ 
tectures,  but  there  are  differences  at  the  level  of  detailed  implementation 
that  impact  performance.  The  efficiency  of  a  simulation  program  depends 
on  the  characteristics  of  the  target  computer:  the  underlying  architecture 
(message-passing  multicomputer,  shared-memory  multiprocessor  [11]),  the 
amount  of  memory,  the  number  of  processors,  the  speed  of  synchroniza¬ 
tion,  and  process  switch  time.  The  question  of  interest  is  how  appropriate  a 
simulation  algorithm  is  for  a  given  application  and  a  given  target  computer. 

Another  difference  between  the  method  proposed  here  and  those  in  the 
literature  is  our  use  of  conditional  events.  The  event  list  in  a  sequential 
simulation  is  a  list  of  conditional  events:  The  meaning  of  an  event  in  the 
event  list  is  that  it  is  the  next  event  to  be  executed  provided  there  is  no  ear¬ 
lier  event.  For  example,  a  preemptive  priority  queue  serving  a  low-priority 
customer  with  a  remaining  service  time  of  T  units  will  complete  service  of 
the  customer  T  units  later,  provided  no  higher  priority  jobs  arrive  while  the 
low-priority  customer  is  being  served.  The  earliest  event  in  the  event  list  is 
the  next  event  to  be  executed,  even  though  it  is  a  conditional  event.  Thus 
the  event  list  is  a  mechanism  for  determining  definite  events  (i.e.,  events 
that  occur  in  the  system  being  simulated)  from  conditional  events.  We 
employ  conditional  events  in  much  the  same  way  that  they  are  employed 
in  sequential  simulation.  We  also  use  other  ways  of  determining  definite 
events.  Consider  again  a  preemptive  priority  queue  serving  a  highest  prior¬ 
ity  customer  with  a  remaining  service  time  of  T  units.  This  customer  will 
depart  after  T  time  units  no  matter  what  happens  in  the  future.  Thus  we 
can  determine  that  the  departure  of  the  highest  priority  customer  is  a  defi¬ 
nite  event  without  using  the  event  list.  A  critical  element  of  the  success  of 
our  method  is  determining  as  many  definite  events  as  possible  without  using 


the  fact  that  the  earliest  conditional  event  is  also  a  definite  event.  The  use 
of  conditional  events  guarantees  absence  of  deadlocks,  and  furthermore  we 
need  not  use  null  messages  to  guarantee  progress  [7]. 

2  The  Physical  System 

2.1  Physical  Processes 

Our  goal  is  to  simulate  a  system  on  a  parallel  computer.  The  system  that  is 
to  be  simulated  is  called  the  physical  system,  and  a  process  in  the  physical 
system  is  called  a  physical  process  (or  PP)  in  contrast  to  a  process  in  the 
simulator,  which  is  called  a  logical  process  (or  LP). 

In  the  physical  system,  time  is  integer-valued.  Initially  time  has  value 
0.  We  are  required  to  determine  the  behavior  of  the  physical  system  at  all 
times  in  the  interval  [0,  H J  where  H  is  the  horizon  of  the  simulation. 

A  set  of  input  ports  and  a  set  of  output  ports  are  associated  with  each 
PP;  inputs  to  the  PP  are  received  along  its  input  ports,  and  outputs  from 
the  PP  are  sent  along  its  output  ports.  The  state  and  the  outputs  of  a  PP 
at  time  t  +  1  are  functions  of  its  state  and  its  inputs  at  time  t  for  all  t  where 
t  >  0.  Thus  a  PP  is  defined  by  a  set  of  input  ports,  a  set  of  output  ports,  a 
set  of  states,  an  initial  state  and  a  next-state  function,  and  an  initial  output 
and  a  next-output  function  for  each  output  port. 

2.2  A  System  of  Physical  Processes 

A  physical  system  is  a  set  of  PPs  and  a  set  of  connections  where  each  con¬ 
nection  connects  one  output  port  of  a  PP  to  one  input  port  of  another  PP. 
The  value  of  an  input  port  is  (always)  equal  to  the  value  of  the  output  port 
that  it  is  connected  to.  The  state  of  a  physical  system  is  given  by  the  states 
of  its  PPs  and  the  values  of  the  ports  of  its  PPs.  Thus  the  state  of  a  physical 
system  at  time  t  +  1  is  a  function  of  its  state  at  time  f,  for  all  t  where  t  >  0. 
The  initial  system  state  is  determined  by  the  initial  states  of  its  component 
PPs.  The  problem  is  to  compute  the  state  of  the  physical  system  for  all 
times  t  in  the  interval  [0,  H], 

A  simple  solution  is  to  compute  the  state  of  the  physical  system  for 
increasing  values  of  t,  for  all  t  in  [0,i/j.  In  this  paper  we  design  a  parallel 
solution. 

All  PPs  and  all  ports  have  unique  names. 
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3  The  Simulator 

3.1  Overview  of  the  Algorithm 

The  simulator  is  described  for  a  message-passing  multicomputer  implemen¬ 
tation  in  which  messages  are  delivered  in  the  order  sent.  We  first  describe 
a  synchronous  version  of  the  algorithm  and  later  describe  an  asynchronous 
version. 

The  simulator  has  one  logical  process  (LP)  for  each  PP.  Let  PP  X  be 
simulated  by  LP  X ,  for  all  X.  Assume,  for  the  time  being,  that  the  simulator 
network  is  fully  connected,  i.e.,  each  LP  can  send  messages  to  all  LPs. 

The  basic  operation  of  LP  X  is  a  repetition  of  the  following  loop: 
begin- loop 

1.  Determine  inputs  to  PP  X  from  initial  conditions  and  messages  it  has 
received  from  other  LPs. 

2.  Compute  outputs  from  PP  X  as  far  into  the  future  as  possible  given 
the  inputs  to  PP  X  that  have  been  determined  so  far,  and  send  a 
message  to  each  LP. 

3.  Wait  to  receive  a  message  from  each  LP. 

end-loop 

The  algorithm  is  synchronous  in  the  sense  that  an  LP  receives  messages 
from  all  other  LPs  on  each  execution  of  the  loop.  The  synchronous  algorithm 
is  simpler  than  the  asynchronous  version  in  which  an  LP  does  not  wait  to 
receive  a  message  from  every  other  LP,  and  so  we  describe  the  synchronous 
version  first. 

3.2  Local  Variables  of  an  LP 

An  LP  X  has  the  following  local  (integer)  variables: 

•  a  time  uT  for  each  input  port  r 

•  a  time  u,  for  each  output  port  s 

•  a  set  Ex  of  pairs  (op,msg)  where  op  is  an  output  port  of  PP  X  and 
msg  is  a  message  sent  along  op 

•  a  time  nextx 
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•  a  time  C\'[V']  for  each  LP  Y 

The  local  variables  of  LP  A'  have  subscript  A',  r,  or  s;  this  makes  it  easier 
to  describe  the  algorithm  because  each  variable  used  in  the  algorithm  has  a 
distinct  name  (because  all  processes  and  ports  have  unique  names). 

Local  variables  ur  and  us  have  the  following  meaning: 

•  LP  X  has  received  messages  from  another  LP  defining  the  input  to 
PP  A  along  port  r  in  the  interval  [0,ur],  and 

•  LP  .V  has  sent  messages  to  another  LP  defining  the  output  from  PP 
A’  along  port  s  in  the  interval  [0,ua]. 

Variable  C\  [  V  ]  is  a  buffer  to  store  the  nexty  field  of  messages  received 
from  LP  V;  all  messages  from  LP  Y  contain  a  nexty  field,  and  when  LP 
X  receives  a  message  from  LP  Y  it  stores  the  value  of  the  nexty  field  of 
the  message  in  C\  [T],  An  invariant  of  the  program  is  that  if  there  are  no 
messages  in  transit  from  LP  Y  to  LP  A'  then  Cx[Y ]  =  nexty.  Similarly, 
ur  may  be  though*  of  as  a  buffer  for  us,  where  ports  r  and  s  are  connected. 
Let  s  be  an  output  port  of  PP  Y  connected  to  input  port  r  of  PP  A'.  Every 
message  sent  by  LP  Y  to  LP  X  contains  the  value  of  u3,  and  when  LP  A" 
receives  the  message  it  stores  this  value  in  ur.  An  invariant  of  the  program 
is  that  if  there  are  no  messages  in  transit  from  LP  Y  to  LP  X  then  ur  =  u3. 
Variables  ur  and  us  are  monotone  nondecreasing. 

The  Earliest  Conditional  Event.  Let  cond3  be  the  time  at  which 
the  next  message  is  sent  by  PP  A  along  output  port  s  after  time  us  if  there 
are  no  inputs  along  r  after  time  ur,  for  all  input  ports  r  of  PP  A'.  Define 
nextx  as  the  minimum  of  conds  over  all  output  ports  s  of  PP  A.  Define  Ex 
as  the  set  of  pairs  ( op,msg )  where  PP  A'  sends  message  msg  along  output 
port  op  at  time  nextx  if  there  are  no  inputs  along  r  after  time  uT ,  for  all 
input  ports  r  of  PP  A. 

Example.  Consider  a  preemptive  priority  queue  with  one  input  port  r 
through  which  all  customers  arrive,  and  one  output  port  s  through  which  all 
customers  depart.  We  treat  the  queue  as  a  PP  and  customers  as  messages. 
Since  there  is  only  one  output  port,  nextx  =  cond3.  Initially  uT  —  0  and 
u„  =  0.  Customers  in  the  queue  have  one  of  two  priorities:  high  or  low. 
We  shall  show  that  if  initially  the  customer  at  the  head  of  the  queue  has  a 


service  time  of  T  units,  then  cond ,  —  T,  and  if  initially  the  queue  is  empty, 
then  cond,  =  oo. 

Suppose  the  customer  at  the  head  of  the  queue  is  a  high-priority  cus¬ 
tomer.  Then  this  customer  will  leave  the  queue  at  time  T,  no  matter  wrhat 
future  inputs  are.  Therefore,  in  this  case,  cond,  —  T,  and  Ex  is  {(5,///)} 
where  HI  represents  the  departure  of  a  high-priority  customer. 

Now  suppose  the  customer  at  the  head  of  the  queue  is  a  low-priority 
customer  (and  in  this  case  there  are  no  high-priority  customers  in  the  queue 
initially).  If  a  high-priority  customer  arrives  before  T,  then  the  low-priority 
customer  is  preempted  and  must  wait  at  least  until  the  arriving  high-priority 
customer  finishes  service.  If  no  high-priority  customers  arrive  at  the  queue 
before  T,  then  the  low-priority  customer  will  leave  the  queue  at  time  T. 
Therefore,  cond,  =  T  because  cond,  is  the  time  of  the  next  departure  after 
time  u,  along  port  s  if  there  are  no  arrivals  along  port  r  after  time  ur.  In  this 
case.  Ex  is  {(s,LO)}  where  LO  represents  the  departure  of  a  low-priority 
customer. 

Finally,  consider  the  case  where  the  queue  is  empty  initially.  No  customer 
leaves  the  queue  if  no  customer  enters  the  queue,  and  hence  cond,  =  00. 
In  this  case,  Ex  is  {(s,nu//)}  where  null  represents  the  departure  of  no 
customers. 

The  meaning  of  cond,  is  an  extension  of  the  meaning  of  the  time  of  a 
conditional  event  in  sequential  simulation. 

In  addition  to  the  variables  described  here,  an  LP  has  other  local  vari¬ 
ables  to  carry  out  the  simulation  of  the  corresponding  PP  and  to  ga'her 
statistics;  at  this  time  we  choose  to  ignore  these  variables  and  to  focus  at¬ 
tention  on  the  variables  required  for  communication. 

Note:  The  symbol  u  stands  for  upto:  Communication  along  a  port  r 
has  been  computed  upto  time  uT.  The  symbol  s  is  used  for  output  ports 
because  it  stands  for  sending  port.;  similarly,  tl.  symbol  r  is  used  for  input 
ports  because  it  stands  for  receiving  port. 

3.3  Messages  Exchanged  by  LPs 

A  message  sent  by  an  LP  X  to  an  LP  Y  contains  the  following  information: 

•  For  each  output  port  s  of  PP  X  that  is  connected  to  an  input  port  of 
PP  Y:  A  pair  ( s,D ,)  where  D,  describes  the  output  along  port  s  of 
PP  A  after  time  u,. 
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•  The  time  next \  defined  earlier. 


We  now  define  I)s. 

Representing  Physical  Communication.  A  sequence  of  outputs 
after  time  us  along  an  output  port  s  of  a  PP  is  represented  in  the  simulator 
by  the  sequence  I) ,  of  pairs  (/[/],  c  [/]),  0  <  i  <  A’  where  A’  >  0,  f[ij  is  a  time 
where  t[i)  >  ns.  and  e[i]  is  either  an  output  along  s  or  the  special  symbol 
null.  Sequence  D,  is  ordered  in  increasing  order  of  t[i]  and  represents  the 
outputs  along  .«  in  the  interval  («5./[A’]]  in  a  simulation  in  which: 

•  r[/j  is  output  along  s  at  time  /[i]  if  c[j’]  is  not  null,  and  nothing  is 
output  along  .«  at  t[i]  if  c[t]  is  null;  and 

•  nothing  is  output  along  5  in  the  interval  (u,,f[A']]  at  times  other  than 
t[i],  0  <  i  <  K. 

Concurrent  with  sending  the  message  containing  Ds,  ua  is  assigned  the  value 
f[A'j;  and  when  an  LP  receives  a  message  containing  D, ,  ur  is  assigned  the 
value  t[I\]  where  r  is  the  input  port  connected  to  3.  Therefore,  u,  and  ur 
are  monotone  nondecreasing. 

Example.  Consider  a  simulation  in  which  for  some  output  port  s  of 
a  PP  A"  the  following  values  are  output  after  time  3:  B  at  time  5,  C  at 
time  10.  B  again  at  time  20:  and  there  are  no  other  outputs  along  s  in  the 
interval  (3.  30].  With  u ,  =  3.  the  outputs  along  port  s  in  the  interval  (3,  30] 
can  he  represented  in  the  simulation  by  a  message  containing  the  following 
sequence:  (o.B),  (10.C),  (20. B),  ( 30, null).  Concurrent  with  the  sending  of 
the  message,  u ,  becomes  30:  and  when  an  LP  receives  the  message,  it  sets 
ur  to  30. 

3.4  The  Algorithm 

Initially,  for  all  input  ports  r:  ur  =  0,  and  for  all  output  ports  s:  u9  =  0.  For 
simplicity,  assume  that  no  messages  are  sent  in  the  physical  system  at  time 
0:  therefore,  assume  that  initially  all  LPs  have  received  messages  (0 ,null) 
for  each  input  port  and  have  received  messages  with  next y  =  0  from  all  LPs 
V\  Hence,  initially  nextx  -  0.  CA-[V]  =  0,  and  Ex  =  {( s,null )}  where  s  is 
an  output  port  of  PP  X . 

An  LP  X  repeatedly  executes  the  following  loop: 


begin-Ioop 

1.  Obtain  definite  events  from  conditional  events 

If  equation  1  (below)  holds, 

nextx  =  minimum  over  all  Y  of  Cx[Y}-  (1) 

then  msg  is  (definitely)  output  by  PP  A'  along  port  op  at  time  nextx 
for  all  pairs  (op,  msg)  in  Ex- 

Compute  as  many  definite  events  as  possible 

The  messages  received  by  LP  X  describe  the  input  to  PP  X  along 
port  r  upto  time  uT.  for  all  r.  Use  the  initial  conditions,  the  messages 
received,  and  the  definite  events  obtained  from  conditional  events  to 
compute  the  (definite)  output  of  PP  X  as  far  into  the  future  as  possi¬ 
ble. 

2.  Message  sending 

Update  nextx  and  Ex  and  send  a  message  to  each  LP  Y  —  the 
message  contains  the  (updated)  value  of  nextx ,  and  for  each  output 
port  s  of  PP  X  connected  to  an  input  port  of  PP  Y  the  value  (s,  D ,) 
—  and  update  u3. 

3.  Message  receiving 

Wait  to  receive  a  message  from  each  LP.  Upon  receiving  a  message 
from  LP  Y,  set  Cx[V]  to  the  nexty  field  of  the  message,  and  update 
uT  for  all  input  ports  r  of  PP  X  connected  to  PP  Y . 

end-loop 

Example.  Consider  a  cyclic  queueing  network  with  three  preemptive  pri¬ 
ority  queues:  qo ,  q\,  72 ,  where  the  outputs  of  qo ,  q i,  72  are  the  inputs  to  q  1, 
72,  qo,  respectively.  There  are  two  priorities  of  customers  —  high  and  low 
and  priorities  do  not  change.  The  system  contains  three  customers  who 
remain  in  the  system  forever;  customers  do  not  enter  or  leave.  Initially  qo 
contains  a  high-priority  customer  with  a  service  tone  of  13  units,  71  contains 
a  low-priority  customer  with  a  service  time  of  17  units,  and  72  contains  a 
low-priority  customer  with  a  service  time  of  11  units.  For  simplicity,  assume 
that  a  customer’s  service  time  is  the  same  in  all  queues.  Let  ri  and  si  be 
the  input  and  output  ports  (respectively)  of  queue  i,  i  =  0,1,2.  The  first 
few  iterations  of  the  simulation  are  given  in  Table  1. 

The  table  shows  the  val  ;es  of  variables  at  the  end  of  each  iteration.  Now 
we  shall  discuss  the  first  iteration  in  detail. 
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iteration 

mm 

0 

ri 

2 

am\ 

queue  0 

next 

0 

13 

oo 

24 

E 

null 

HI 

null 

LO 

u  -  rO 

0 

0 

11 

11 

u  -  sO 

0 

13 

13 

13 

_ 

!fl  ' 

13, HI 

0 

0 

queue  1 

next 

0 

17 

2G 

30 

LO 

HI 

LO 

u  -  rl 

n 

13 

13 

13 

u  -  si 

0 

0 

26 

26 

D 

0 

26, HI 

0 

queue  2 

next 

0 

11 

11 

39 

E 

null 

LO 

LO 

HI 

u  -  r2 

0 

0 

0 

26 

u  -  s2 

0 

0 

11 

39 

D 

0 

11, LO 

39, HI 

Table  1:  A  simulation 


A  high-priority  customex  will  leave  queue  0  at  time  13  no  matter  what 
future  inputs  may  be.  Therefore,  LP  0  sends  a  message  with  D, 0  =  (13,  HI) 
to  LP  1  and  concurrently  sets  us o  to  13.  Since  a  high-priority  job  leaves  at 
time  13,  nexto  =  13  and  Eq  —  {(sO,  H /)}.  Since  LP  0  receives  a  message 
from  LP  2  with  D,2  being  the  empty  sequence,  uT o  remains  unchanged  at  0. 

Now  consider  queue  1.  A  low-priority  customer  will  depart  the  queue  at 
time  17  if  no  high-priority  customer  arrives  earlier.  Therefore,  next\  =  17 
and  E\  =  {(si,  TO)}.  LP  1  sends  a  message  with  D,\  =  ()  to  LP  2, 
and  leaves  us\  unchanged  because  no  definite  output  from  queue  1  can  be 
predicted.  The  behavior  of  queue  2  is  similar. 

On  the  second  iteration,  LP  2  converts  the  conditional  event  of  a  depar¬ 
ture  of  a  low-priority  job  at  tir  1 1  into  a  definite  event  because  it  has  the 
smallest  value  of  next.  The  •  -.inder  of  the  table  is  straightforward. 

3.5  Outline  of  Proof  of  Oor<-3ctness 

Safety.  We  are  required  to  prove  the  invariant  that  all  outputs  of  LPs 
are  outputs  of  PPs.  For  oui  program  this  reduces  to  showing  that  the 
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conversion  of  conditional  events  to  definite  events  is  correct.  We  shall  show 
that  if  next-x  =  minimum  over  all  Y  of  Cx[Y],  then  the  outputs  in  Ex  are 
sent  at  time  nextx  in  PP  X. 

We  shall  use  the  superscript  k  to  denote  the  value  of  a  variable  at  the 


end  of  the  k- th  iteration.  From  the  program: 

Ckx[Y]  =  nextkY.  (2) 

uk  —  u J,  for  all  ports  r ,  s  connected  to  each  other.  (3) 

Define  T  as  follows: 

T  =  minimum  over  all  Y  of  CX[Y].  (4) 

From  equation  2  and  the  above: 

T  =  minimum  over  all  Y  of  nexty  ■  (5) 


We  shall  show  by  induction  on  t  that  for  all  t  where  t  <  T,  there  is  no 
output  (in  the  physical  system)  along  port  s  in  the  interval  (u*,t j  for  all 
output  ports  s  in  the  system,  and  there  is  no  input  along  port  r  in  the 
interval  (uk,  t]  for  all  input  ports  r  in  the  system. 

Base  Case.  The  induction  hypothesis  holds  vacuously  for  t  =  0. 
Induction  Step.  Assume  the  hypothesis  true  for  all  times  up  to  and 
including  t  —  1,  where  t  <  T,  and  we  shall  prove  it  true  for  t.  For  all 
input  ports  r  there  is  no  input  in  the  interval  (uk,t  —  1],  from  the  induction 
hypothesis.  From  the  meaning  of  nexty ,  for  all  t  where  1 ik  <  t  <  nexly, 
there  is  no  output  along  port  s  of  PP  Y  at  time  t  if  for  all  input  ports  r  of  PP 
Y  there  is  no  input  along  port  r  in  the  interval  (uk.t  -  1];  from  equation  5 
the  same  holds  for  all  t  where  uk  <  t  <  T.  Therefore,  for  all  output  ports 
s  there  is  no  output  at  time  t  where  uk  <  t  <  T.  Employing  the  induction 
hypothesis,  there  is  no  output  along  port  s  in  the  interval  (u£,<].  From 
equation  3  there  is  no  output  along  port  s  in  the  interval  (u£,<].  Since  the 
value  of  an  input  port  is  equal  to  the  value  of  the  output  port  that  it  is 
connected  to,  there  is  no  input  along  an  input  port  r  in  the  interval  («*,<]. 
This  completes  the  proof  by  induction. 

Let  nextkx  =  T.  We  have  shown  that  for  all  input  ports  r  of  PP  X  there 
are  no  inputs  in  the  interval  ( uk,T ).  From  the  meaning  of  nextkx  it  follows 
that  at  time  nextkx,  PP  X  outputs  msg  along  port  op  for  all  (op,msg)  in 
E\. 


10 


Progress.  For  the  A:-th  iteration,  all  k ,  there  is  at  least  one  LP  X  such 
that: 

nextkx  —  minimum  over  all  Y  of  nexty.  (6) 

This  LP  determines  at  least  one  definite  event.  This  event  is  output  on 
the  A;-th  iteration  if  it  can  be  determined  that  the  event  is  definite  without 
using  equation  6;  otherwise,  it  is  output  on  the  k  +  1-th  iteration.  Thus  at 
least  one  LP  outputs  an  event  at  least  once  in  every  two  iterations.  Since 
the  outputs  along  a  port  are  monotone  increasing  in  simulated  time,  the 
simulation  progresses. 

4  Asynchronous  Version 

In  the  synchronous  version,  an  LP  receives  messages  from  all  other  LPs  and 
sends  messages  to  all  other  LPs  in  each  iteration.  Now  consider  an  asyn¬ 
chronous  version  in  which  there  are  no  such  iterations.  An  LP  computes  the 
output  of  the  corresponding  PP  as  far  into  the  future  as  possible,  and  sends 
this  output  to  the  appropriate  LPs.  When  an  LP  receives  a  message  from 
any  other  LP,  it  computes  further  output  of  the  PP  and  sends  the  output. 
Thus  an  LP  need  not  wait  to  receive  messages  from  all  other  LPs  before 
computing  further  and  sending  messages.  The  simulation  may  not  progress 
because  all  LPs  may  be  idle,  waiting  to  receive  messages.  Conditional  events 
can  be  used  to  avoid  deadlock. 

As  long  as  there  are  messages  waiting  in  an  LP’s  input  buffer,  the  LP 
does  not  use  conditional  events.  Conditional  events  are  used  only  when 
the  LP  would  be  otherwise  idle.  The  proof  of  the  synchronous  version  sug¬ 
gests  hov:  the  asynchronous  program  can  be  structured.  The  proof  rests  on 
equations  2  and  3,  and  these  equations  play  a  key  role  in  the  design  of  the 
asynchronous  algorithm.  Our  goal  is  to  design  an  algorithm  in  which  an  LP 
X  can  determine  whether  equations  2  and  3  (or  equivalent  equations)  hold. 
We  now  present  an  algorithm  and  later  discuss  its  proof. 

The  Algorithm.  Each  LP  Y  records  nexty  and,  for  each  of  its  input 
ports  r,  the  number  of  messages  nT  it  has  received  along  r,  and,  for  each 
of  its  output  ports  s,  the  number  of  messages  n,  it  has  sent  along  s.  The 
recording  must  be  an  atomic  action  in  the  sense  that  the  values  of  nexty ,  nr, 
and  n,  must  not  change  during  the  recording.  The  recording  is  carried  out 
at  arbitrary  times  with  the  constraint  that  an  LP  re-records  its  values  some 
finite  time  after  they  change.  LP  Y  broadcasts  (also  at  arbitrary  times)  the 
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values  it  has  recorded,  again  with  the  proviso  that  it  re-broadcasts  values 
some  finite  time  after  they  change.  An  LP  A'  has  local  variables  Cx\Y], 
Dx[r],  and  0x[4  in  which  it  retains  the  last  values  received  of  nexty ,  nr, 
and  ns,  respectively,  for  all  LPs  Y  where  Y  ^  X ,  and  all  ports  r,  s  of  PPs  in 
the  system  other  than  PP  A'  itself.  LP  A'  guarantees  that  C\'[X]  =  nextx , 
and  for  all  its  input  ports  r,  D\[r)  =  nT ,  and  for  all  its  output  ports  s, 
PxN  =  ns.  (Therefore,  LP  A'  does  not  send  messages  to  itself.) 

Conditional  events  are  converted  to  definite  events  as  follows.  We  are 
given  the  conditional  event  that  PP  A"  outputs  the  messages  in  Ex  at  time 
nextx  if  for  each  of  its  input  ports  r,  it  receives  no  message  along  r  after 
time  uT.  If  equation  1  and  equation  7  (below)  hold,  then  PP  X  (definitely) 
outputs  the  messages  in  Ex  at  time  nextx- 

for  all  ports  r,  s  connected  to  each  other  :  Dx[r]  =  Z?x[.s].  (7) 

Note  that  all  the  variables  named  in  equations  1  and  7  are  local  to  LP  X . 

Outline  of  Proof.  The  proof  is  based  on  the  concept  of  global  snapshots 
[2,  3].  A  global  snapshot  is  a  state  of  the  system  (and  in  this  case  the  system 
is  the  simulator)  that  could  have  occurred  earlier.  If  each  process  records  its 
local  state,  the  messages  it  has  sent  on  each  output  port,  and  the  messages 
it  has  received  on  each  input  port,  then  the  collection  of  local  recordings  is 
a  global  state  if  and  only  if  the  number  of  messages  sent  along  each  output 
port  is  greater  than  or  equal  to  the  number  of  messages  received  along  the 
input  port  that  it  is  connected  to.  (We  assume  that  initially  there  are  no 
messages  in  transit.)  The  state  of  a  channel  from  an  output  port  s  to  an 
input  port  r,  in  the  global  state,  is  the  sequence  of  messages  sent  along  s 
in  the  recording,  excluding  the  sequence  of  messages  received  along  r  in  the 
recording. 

If  equation  7  holds,  then  the  values  nexty  =  Cx[Y],  nr  =  PxMi  and 
n,  =  £>x[s]  form  (part  of)  a  global  state  G,  because  for  all  connected  pairs 
of  ports  s  and  r,  the  number  of  messages  sent  along  port  s  in  G  is  equal  to 
the  number  of  messages  received  along  r  in  G.  In  global  state  G,  nexty  = 
Cx\Y\,  see  the  similarity  to  equation  2.  Also,  for  all  connected  ports  r  and 
s,  uT  =  Uj  in  G  —  see  the  similarity  to  equation  3  —  because  nr  =  n ,  in 
G.  Now  the  proof  that  PP  X  outputs  the  messages  in  Ex  at  time  nextx  is 
exactly  the  same  as  in  the  synchronous  version,  with  the  values  of  variables 
at  the  &-th  iteration  in  the  synchronous  version  representing  the  values  of 
the  corresponding  variables  in  global  state  G  in  the  asynchronous  version. 
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Outline  of  Proof  of  Progress.  We  shall  show  that  the  simulation  does 
not  reach  a  deadlocked  state  in  which  all  LPs  are  waiting  for  messages  and 
all  channels  are  empty.  If  the  system  remains  in  a  state  in  which  there 
are  no  messages  in  transit,  then  (from  the  algorithm),  for  all  LPs  .Y,  equa¬ 
tion  7  holds  (because  nT  =  ns  since  there  are  no  messages  in  transit,  and 
Dx[?\  —  nr,Z)jc[s]  —  since  LP  X  receives  the  latest  recorded  values 
from  all  processes).  By  the  same  argumemt,  for  LPs  .Y,  Y:  nexty  =  Cx[Y]- 
Consider  the  LP,  say  A',  with  the  smallest  value  of  next.  For  this  LP,  equa¬ 
tion  1  holds  as  well,  and  hence  it  converts  a  conditional  event  to  a  real  event. 
Therefore,  at  least  one  LP  eventually  makes  progress. 

5  Experimental  Results 

5.1  The  Physical  System 

For  our  initial  experiments,  we  chose  problems  that  appear  to  be  unsuited  for 
conservative  methods.  Experiments  in  the  literature  [8]  suggest  that  conser¬ 
vative  distributed  simulation  methods  are  inefficient  for  queueing  networks 
in  which  the  routes  that  customers  follow  through  networks  are  random. 
Therefore,  for  our  first  set  of  experiments  we  chose  networks  with  several 
switches,  where  each  switch  has  several  incoming  edges  and  several  outgoing 
edges;  a  customer  entering  a  switch  leaves  via  one  of  the  outgoing  edges  with 
a  given  probability.  A  customer  leaving  a  switch  enters  a  tandem  sequence 
of  queues  and  then  enters  another  switch.  Experiments  were  done  both  for 
first-come-first-served  (fcfs)  queues  and  for  preemptive-priority  queues.  In 
the  later  case,  each  customer  in  the  network  is  assigned  a  priority:  either 
high  or  low.  The  network  is  closed,  i,e.,  customers  neither  leave  nor  enter 
the  network,  and  customers  that  are  initially  in  the  network  remain  in  the 
network  forever.  Our  experiments  are  characterized  by  the  number  jV  of 
switches,  the  number  L  of  queues  between  the  output  of  one  switch  and  the 
input  of  another,  the  probabilities  associated  with  each  output  edge  from 
each  switch,  and  the  number  of  customers  in  the  network.  The  last  param¬ 
eter  is  denoted  by  J  in  the  case  of  fcfs  queues  and  by  Ji,Jh,  standing  for 
the  number  of  low-  and  high-priority  customers,  respectively,  in  the  case 
of  preemptive-priority  queues.  A  variety  of  topologies  can  be  simulated  by 
setting  the  probability  of  following  some  output  edges  of  a  switch  to  0,  and 
others  to  1. 

A  strong  argument  can  be  made  that  queueing  networks  (or  indeed  any 
system  with  a  large  stochastic  component)  should  not  be  simulated  using 
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distributed  simulation.  The  time  taken  to  simulate  stochastic  systems  is 
largely  spent  in  reducing  the  variance  of  the  results,  and  the  most  efficient 
scheme  for  variance  reduction  is  replication.  We  use  queueing  networks 
because  they  serve  as  easily  understood  benchmarks. 

5.2  Implementation 

The  synchronous  version  of  the  algorithm  was  implemented  in  Cosmic  C[10] 
running  on  an  Intel  iPSC/1.  We  did  not  experiment  with  the  asynchronous 
version,  because  the  synchronous  version  gave  almost  linear  speedup,  and 
the  asynchronous  version  cannot  do  much  better.  A  few  points  about  the 
implementation  are  of  general  interest,  and  are  not  limited  to  the  example 
discussed  here. 

5.2.1  Mapping  the  Physical  System  to  a  Multicomputer 

Process  Switching.  The  time  taken  for  the  operating  system  to  switch 
control  from  one  process  to  another,  in  a  node  of  a  multicomputer,  can  be 
substantial.  We  reduce  process-switching  overhead  by  executing  switching 
within  the  simulation  rather  than  by  calling  the  operating  system. 

Load  Balancing.  An  important  aspect  of  efficiency  in  concurrent  systems 
is  load  balancing:  The  implementation  should  be  designed  so  that  all  nodes 
in  the  system  have  about  the  same  amout  of  load.  We  mapped  the  queueing 
network  onto  the  computer  to  ensure  that  each  node  was  (roughly)  equally 
active. 

Grain  Size.  If  the  execution  time  between  message-sending  is  small  com¬ 
pared  to  message  delay,  the  overhead  of  communication  can  negate  the  speed 
gained  from  concurrent  execution.  In  our  experiments,  care  was  taken  to 
ensure  that  the  the  process  grain  size  matched  the  computer.  The  networks 
we  simulated  were  large,  and  the  amount  of  computation  per  message  was 
of  the  same  order  as  message  delay. 

Lookahead.  The  simulation  is  written  so  that  each  LP  determines  the  be¬ 
havior  of  a  PP  as  far  into  the  future  els  possible.  An  LP  ceases  computation, 
and  waits  for  additional  input,  only  if  it  cannot  determine  any  more  outputs 
of  the  PP  that  it  is  simulating. 
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Caveat.  The  success  of  our  experimental  results  is  probably  based  on  the 
care  with  which  the  problem  was  mapped  onto  the  iPSC/1.  Poor  results 
may  be  obtahied  from  a  simulation  in  which: 

•  the  overhead  of  process-switch  time  plays  a  significant  role  in  the  ex¬ 
ecution  time  of  the  simulator,  or 

•  the  loads  across  nodes  of  the  computer  are  significantly  out  of  balance, 
or 

•  there  is  a  mismatch  between  the  process  grain  size  and  the  target 
computer,  or 

•  LPs  do  not  look  ahead. 

5.2.2  Unifying  Conservative  Methods 

A  goal  of  our  experiments  is  to  unify  conservative  methods  on  the  one  hand, 
and  ideas  that  have  been  developed  in  sequential  simulation  on  the  other. 
We  employ  null  events  by  which  one  LP  informs  another  how  far  it  has 
progressed  in  the  simulation,  and  in  this  sense,  our  method  is  similar  to 
null  event  schemes  in  the  literature;  but  we  do  not  depend  on  null  events 
to  guarantee  progress.  The  speedup  obtained  does  depend  significantly  on 
the  use  of  null  messages.  We  employ  conditional  events  in  much  the  same 
way  that  they  are  employed  in  sequential  simulation,  and  in  this  sense,  our 
method  is  an  extension  of  sequential  simulation.  The  use  of  conditional 
events  can  be  thought  of  as  one  way  of  breaking  deadlock;  and  in  this  sense, 
our  method  has  some  similarity  to  deadlock-breaking  schemes,  except  that 
there  is  no  need  for  a  deadlock-detection  algorithm. 

A  goal  of  our  (continuing)  work  on  efficient  implementations  of  dis¬ 
tributed  simulation  is  a  search  for  paradigms  that  unify  apparently  disparate 
ideas. 

5.3  Results 

Experiments  were  done  for  networks  of  N  =  12  and  N  —  24  switches,  with 
path  lengths  L  -  5  and  10.  For  fcfs  queues,  we  experimented  with  number 
of  customers  J:  N  x  10,  N  x  50,  and  N  x  100.  In  the  case  of  priority 
queues,  experiments  were  done  with  Jt,Jh  (the  number  of  low-  and  high- 
priority  customers,  respectively)  as  follows  :  N  x  (10, 10),  N  x  (50, 10),  and 
N  x  (90,10). 
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Each  of  the  above  networks  was  simulated  once  with  equal  probabilities 
for  each  output  port  of  each  switch  and  once  with  different  probabilities  for 
each  output  port  (e.g.,  for  N  =  12,  probabilities  range  from  0  to  0.2).  The 
results  were  very  similar  for  the  same  network  for  the  two  cases  (i.e.,  equal 
and  different  probabilities). 

Speedup  was  computed  relative  to  the  execution  time  for  one  simula¬ 
tion  process.  For  this  example,  the  sequential  version  resulting  from  the 
distributed  algorithm  w'as  at  least  as  efficient  as  a  central  event  queue  im¬ 
plementation. 

Figures  1  through  3  and  5  through  7  summarize  the  speedups  computed 
for  the  fcfs  networks  of  size  N  =  12  and  N  =  24,  respectively,  using  M  =  2 
to  24  processors.  Figures  4  and  8  give  the  maximal  speedups  computed  for 
these  networks  for  different  values  of  J,  for  N  =  12  and  N  =  24,  respectively. 
Figures  9  through  11  summarize  the  speedups  computed  for  the  priority 
queuing  networks  with  ,V  =  24,  and  figure  12  give  the  maximal  speedups 
computed  for  this  network  for  the  different  values  of  Jh,Ji- 

6  Conclusions 

The  ratio  of  the  execution  time  of  a  concurrent  program  to  the  execution 
time  of  a  sequential  program  (for  the  same  problem)  depends  on  several 
factors: 

•  The  concurrent  and  sequential  algorithms  employed 

In  our  case,  the  conventional  sequential  discrete-event  simulation  pro¬ 
gram  is  slower  than  the  concurrent  algorithm  running  on  a  single  pro¬ 
cessor.  This  is  because  the  concurrent  algorithm  is  tailored  to  the 
problem  (simulating  a  network  of  priority  queues)  whereas  the  sequen¬ 
tial  algorithm  is  general.  Therefore,  in  our  analysis,  we  compared  the 
execution  times  of  the  concurrent  algorithm  running  on  computers 
with  different  numbers  of  processors. 

•  Computations  on  global  states 

In  most  physical  systems,  the  behavior  of  a  PP  can  affect  the  be¬ 
havior  of  all  other  PPs.  (For  instance,  a  job  departing  from  a  queue 
may  eventually  arrive  at  each  of  the  other  queues.)  Because  of  this 
‘global’  effect,  most  distributed  simulations  carry  out  some  computa¬ 
tion  on  the  (global)  state  of  the  simulation;  deadlock  detection  and 
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the  computation  of  global  virtual  time  [6]  are  examples  of  such  com¬ 
putations.  Global  computations  require  communication  between  LPs. 
The  greater  the  frequency  of  such  computations,  the  greater  the  com¬ 
munication  overhead. 

The  implementation  of  our  algorithm  is  semi-synchronous:  the  algo¬ 
rithm  consists  of  repeated  executions  of  a  loop  in  which  LPs  carry 
out  a  simple  global  computation  and  then  each  LP  carries  out  a  lo¬ 
cal  computation.  The  efficiency  of  the  algorithm  depends  primarily 
on  the  amount  of  local  computation  in  the  loop:  if  there  is  a  large 
amount  of  local  computation,  the  advantage  of  concurrency  outweighs 
the  overhead  of  global  computation.  Therefore,  the  algorithm  provides 
speedup  if  the  problem  is  large  (i.e.,  it  requires  a  great  deal  of  computa¬ 
tion),  the  computation  is  load  balanced,  and  the  number  of  processors 
is  small  enough  that  each  processor  has  a  significant  amount  of  local 
computation  to  carry  out.  In  our  experiments,  the  use  of  null  events 
increased  the  ratio  of  local  to  global  computation  a  hundred-fold.  A 
null  event  gives  a  lower  bound  on  the  time  of  the  next  event;  the  tighter 
the  bound,  the  better  the  ratio  of  local  to  global  computation. 

Better  speedup  was  obtained  for  preemptive-priority  networks  than  for 
first-come-first-served  networks.  This  may  seem  counter-intuitive,  be¬ 
cause  the  future  behavior  of  a  queue  in  a  preemptive-priority  network 
is  more  dependent  on  the  current  behavior  of  other  queues.  (Jobs  de¬ 
part  a  first-come-first-served  queue  in  the  order  in  which  they  arrive,  no 
matter  what  happens  at  other  queues,  whereas  in  a  preemptive-priority 
queue,  a  low-priority  job  is  preempted  by  the  arrival  of  a  high-priority 
job.)  The  speedup  for  preemptive-priority  networks  was  obtained  for 
two  reasons:  first,  simulating  preemptive-priority  networks  requires 
more  computation,  even  in  a  sequential  implementation;  second,  null 
events  are  used  to  determine  the  future  behavior  of  preemptive-priority 
queues. 

•  Algorithms  tailored  to  the  physical  system 

The  speedup  obtained  in  our  experiments  is  due  in  part  to  the  algo¬ 
rithms  being  tailored  to  the  problem:  simulate  a  preemptive-priority 
queueing  network.  We  expect  the  algorithm  to  work  well  on  problems 
that  have  structures  that  can  be  exploited  —  problems  such  as  circuit 
simulation  and  simulations  of  classes  of  queueing  networks.  We  expect 
the  algorithm  to  work  poorly  on  problems  that  have  no  structure  — 
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problems  in  which  any  PP  can  communicate  with  any  other  PP. 


In  summary,  it  is  possible  to  obtain  significantly  faster  execution  of  simu¬ 
lations  on  concurrent  computers  by  matching  the  problem  and  the  algorithm 
to  the  computer. 
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Figure  1  :  N  =  12,  J  =  12  x  10 


Figure  2  :  JV  =  12,  J  -  12  x  50 
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